Speech is the most significant mode of communication among human beings and a potential\nmethod for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion\nrecognition using these sensors from speech signals is an emerging area of research in HCI, which\napplies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment,\nhealthcare, and emergency call centers to determine the speakerâ??s emotional state from an individualâ??s\nspeech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion\nrecognition (SER) compared to state of the art and (ii) reducing the computational complexity of\nthe presented SER model. We propose an artificial intelligence-assisted deep stride convolutional\nneural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative\nfeatures from spectrogram of speech signals that are enhanced in prior steps to perform better. Local\nhidden patterns are learned in convolutional layers with special strides to down-sample the feature\nmaps rather than pooling layer and global discriminative features are learned in fully connected layers.\nA SoftMax classifier is used for the classification of emotions in speech. The proposed technique is\nevaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual\nDatabase of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%,\nrespectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of\nthe proposed SER technique and reveals its applicability in real-world applications.
Loading....